Importing All the libraries

Import the CSV Data as Pandas DataFrame

Top 5 records of the dataset

Shape of the dataset

Check Datatypes in the dataset

Summary of the dataset

3. EXPLORING DATA

Insights

case_id have unique vlaues for each column which can be dropped as it it of no importance continent column is highly biased towards asia. hence we can combine other categories to form a single category. unit_of_wage seems to be an important column as most of them are yearly contracts.

Univariate Analysis

The term univariate analysis refers to the analysis of one variable prefix “uni” means “one.” The purpose of univariate analysis is to understand the distribution of values for a single variable

Other Type of Analysis are

Bivariate Analysis: The analysis of two variables. Multivariate Analysis: The analysis of two or more variables.

Insights:

requires_job_training, unit_of_wage, full_time_position and continents coulmns have single category dominating. In rest of the columns are balanced.

Multivariate Analysis:

Multivariate analysis is the analysis of more than one variable.

Insights

There is no multicollinearity between any variables

Check Multicollinearity for Categorical features

Reports

Here requires_job_training fails to Rejects Null Hypothesis which means it doesn't correlate with target column.

Checking Null Values

Reports

There are no missing values.

Initial Analysis Report

No of Employees has many outliers which can be Handled in Feature Engineering and no_of_employees is Right Skewed. yr_of_estab is left skewed and some outliers below the lower bound of Box plot. prevailing_wage is right skewed with outliers above upper bound of box plot. There are No missing values in the dataset. The case_id column can be deleted because each row has unique values. The case_status column is the target to predict. In the Categorical column, features can be made Binary numerical in feature Encoding

4.Visualization

4.1 Visualize the Target Feature

From the chart it is seen that the Target Variable is Imbalanced

What is imbalanced data?

Imbalanced data are types of data where the target class has an uneven distribution of observations, i.e Here Denied value has more count than the Certified value of the dataset.

4.2 Does applicant Continent has any impact on Visa status ?

Insight:

As per the Chart Asia applicants applied more than other continents. 43% of Certified applications are from Asia. This is followed by Europe with 11% of Certified applications. Highest chance of getting certified if you are from Europe and followed by Africa

4.3 Does applicant Education has any impact on Visa status ?

Reports

education status has high impact Doctorate and Master's graduates have higher cange of being accepted then the others.

4.4 Does applicant's previous work experience has any impact on Visa status ?

People with previous work experience has been certified 74.5% only 25.5% applicant has been denied. People with No previous work experience has been certified 56% and denied 43%. This means work experience has effect on Visa status. There is a slight edge for the people with work experiences then the fresheres as expected. But the difference is not huge.

4.5 If the Employee requires job training, does it make any impact on visa status?

Report If employee requires job training and it doesn't have any effect on Visa status. 88% of applicant don't require job training. 63% of people who doesnt want job training got certified. For employees who require job training 67% of people get certified. As we checked in Chi-Squared test this feature doesn't have much impact on target variable, which is confirmed by above plot.

4.6 Does Number of employees of employer has any impact on Visa status?

Insights

The distrubution of both is similar. But there are outliers in both the classes which need to be handled.

4.7 Wages and its impact on Visa statu

Report

For employees who applied for hourly pay 65% of them were denied. Yearly unit wage application were accepted for 69% and denied for 31% of the time. There is a higher chance of yearly contract than other and immediately followed by week and month.

4.7 Does Region of employment has impact on Visa status ?

Report

As per chart all the region have very similar pattern of getting Visa certified and Denied. There is a slight edge for midwest followed by South region

4.8 Does Prevailing wage has any impact on Visa status ?

Insights

# The distribution of both the classes are same but need to handel the outliers.

4.8.1 Prevailing wage based on Education

Report Based on the above table and charts its seen that applicant with Master's education has higher average prevailing wages. Master's education applicants have median salary of $78.8k, which approximately 20% higher than Doctrate's average wage, which is strange.

4.9 Year of Establishment

Report

Each bins consists of 5 years. Many companies were established after year 2000. The most company establishment happened in year 2000-2005.

Final Report

case_id column can be dropped as it is an ID. requires_job_training column can be dropped as it doesn't have much impact on target variable, Proved in visualization and chi2 test. no_of_employees, prevailing_wage columns have outllier which should be handled. continent columns has few unique values with very less count, which can be made as others Target column case_status is imbalanced can be handled before model building.

Handling Missing values

There are no missing values.